Library Imports
from pyspark.sql import SparkSession
from pyspark.sql import types as T
Template
spark = (
SparkSession.builder
.master("local")
.appName("Section 4 - More Comfortable with SQL?")
.config("spark.some.config.option", "some-value")
.getOrCreate()
)
sc = spark.sparkContext
import os
data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path
df = spark.read.csv(path, header=True)
df.toPandas()
id | species_id | name | birthday | color | |
---|---|---|---|---|---|
0 | 1 | 1 | King | 2014-11-22 12:30:31 | brown |
1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | None |
Register DataFrame as a SQL Table
df.createOrReplaceTempView("pets")
What Happened?
The first step in making a df
queryable with SQL
is to register the table as a sql table.
This particular function will replace any previously registered local table named pets
as a result. There are other functions that will register a dataframe with slightly different behavior. You can check the reference docs if this isn't the desired behavior: docs
Let Write a SQL Query!
df_2 = spark.sql("""
SELECT
*
FROM pets
WHERE name = 'Argus'
""")
df_2.toPandas()
id | species_id | name | birthday | color | |
---|---|---|---|---|---|
0 | 2 | 3 | Argus | 2016-11-22 10:05:10 | None |
What Happened?
Once your df
is registered, call the spark sc
function on your spark session
object. It takes a sql string
as an input and outputs a new df
.
Conclusion?
If you're more comfortable with writing sql
than python/spark code, then you can do so with a spark df
! We do this by:
- Register the
df
withdf.createOrReplaceTempView('table')
. - Call the
sql
function on yourspark session
with asql string
as an input. - You're done!